Center for Applied Internet Data Analysis (CAIDA) has been gathering and providing data of Internet for scentific research community. In this project, one publically available data, Trace Statistics for CAIDA Passive OC48 and OC192 Traces was gather and studied.
The data is a space separated text files with some comments. After read into data frame, it was found to have the following columns.
## [1] "SIZE" "X.IPv4." "SCTP" "IPv6" "ESP" "UDP"
## [7] "GRE" "ICMP" "TCP" "UNKNOWN" "X.IPv6." "ICMP6"
## [13] "UDP.1" "TCP.1" "X.IPv6t." "ICMP6.1" "UDP.2" "TCP.2"
The dataset consists of 18 variables.
Let’s look at five samples of the data.
## SIZE X.IPv4. SCTP IPv6 ESP UDP GRE ICMP TCP UNKNOWN X.IPv6.
## 968 988 149025 0 0 0 6963 0 0 142062 0 1452
## 1117 1137 66396 0 0 0 26690 0 0 39706 0 3180
## 625 645 67079 0 0 0 4239 0 0 62840 0 2509
## 408 428 178456 0 0 90 9961 17 121 168267 0 1303
## 949 969 109767 0 0 0 8116 0 0 101651 0 1608
## ICMP6 UDP.1 TCP.1 X.IPv6t. ICMP6.1 UDP.2 TCP.2
## 968 0 0 1452 0 0 0 0
## 1117 0 0 3180 0 0 0 0
## 625 0 16 2493 0 0 0 0
## 408 0 11 1292 0 0 0 0
## 949 0 1 1607 0 0 0 0
Semantically, SIZE is different from other variables. It refers to size of a packet, while all other variables refer to the type of a packet.
The SIZE is quite uniform. Between close to 0 and 1500, the count is 1. This implies the almost all types of different size of packets did appear. Let’s look at the minimal 10 and maximal 10 of the size.
## [1] 21 22 23 24 25 26 27 28 29 30
## [1] 1498 1499 1500 1504 1632 1668 1740 1844 1848 4003
The minimal SIZE is 21. The uniform behavior does not show up when SIZE is beyond 1500. They only show up sparsely and has a very outlier value of 4003.
Now check the distribution of all the other “type” variables.
Linear scale does high concentration on low value and might be some outler of large value. Let’s look at log scale.
In terms of amount of data, TCP, TCP.1, UDP, UDP.1, X.IPv4., X.IPv6. appear to have more abundant data compared to the other types. Suggesting we can look at these variables first in the following analysis.
Given that SIZE is quite different from all others, and semantically we know it is about packet size, while all other variables are about packet types. The relationship to explore likely will be between SIZE and the rest of the 17 variable.
Let’s choose the first two variables to see how they look like.
The first pair of variables show small count of outliers in normal scale and make the main structure of the data difficult to examine.
Also, the data appear to be concentrated so transparency should be used.
With log10 scale on the packet counts “X.IPv4.” and alpha of 1/5 applied, the main structure of the dataset is much easier to comprehend.
Packet size also has outlier beyond value of 1500 and even so beyond 2000.
Eliminated data where size is more than 2000 gives better zoom in for the distribution. Now this looks like a slight negative correlation.
Linear regression gives better view of that negative correlation.
Now we choose another variable X.IPv6.
“X.IPv6.” shows similar slight negative correlation to Size.
Same negative correlation for IPv6t (IPv6 Tunneled) packets.
Previous we eliminated some data when SIZE is larger than 2000. Is there a way to get also good visualization but without cutting data?
Instead of cutting outliers for data with SIZE more than 2000, use log10 scale on SIZE also gives good visualization. This should be preferred method we don’t know yet if cutting data will cut off important information at this moment.
Let’s do the same log10 instead of cutting for the other two examples examined above.
Good. IPv6 can be visualized well using log10 instead of cutting.
Same as IPv6t.
Now let’s look at all varaibles to determine how we want to examine all of them.
## SIZE X.IPv4. SCTP
## Min. : 21.0 Min. : 1 Min. :0.0000000
## 1st Qu.: 392.5 1st Qu.: 55732 1st Qu.:0.0000000
## Median : 764.0 Median : 88785 Median :0.0000000
## Mean : 766.5 Mean : 1180318 Mean :0.0006725
## 3rd Qu.:1135.5 3rd Qu.: 210714 3rd Qu.:0.0000000
## Max. :4003.0 Max. :446257816 Max. :1.0000000
## IPv6 ESP UDP GRE
## Min. : 0.000 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 7558 1st Qu.: 0.0
## Median : 0.000 Median : 0 Median : 13430 Median : 1.0
## Mean : 1.116 Mean : 1902 Mean : 130137 Mean : 164.3
## 3rd Qu.: 0.000 3rd Qu.: 0 3rd Qu.: 31028 3rd Qu.: 4.0
## Max. :505.000 Max. :442389 Max. :71018929 Max. :203249.0
## ICMP TCP UNKNOWN
## Min. : 0.0 Min. : 0 Min. : 0.0000
## 1st Qu.: 0.0 1st Qu.: 42163 1st Qu.: 0.0000
## Median : 0.0 Median : 71155 Median : 0.0000
## Mean : 1653.6 Mean : 1046460 Mean : 0.6765
## 3rd Qu.: 23.5 3rd Qu.: 149548 3rd Qu.: 0.0000
## Max. :272798.0 Max. :446027466 Max. :360.0000
## X.IPv6. ICMP6 UDP.1
## Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1226 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 1853 Median : 0.0 Median : 3.0
## Mean : 37672 Mean : 142.7 Mean : 571.5
## 3rd Qu.: 3633 3rd Qu.: 0.0 3rd Qu.: 18.0
## Max. :22721559 Max. :109201.0 Max. :178140.0
## TCP.1 X.IPv6t. ICMP6.1
## Min. : 0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 1200 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 1762 Median : 0.000 Median : 0.0000
## Mean : 36958 Mean : 1.024 Mean : 0.4217
## 3rd Qu.: 3446 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :22721559 Max. :505.000 Max. :505.0000
## UDP.2 TCP.2
## Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.0000
## Mean : 0.1231 Mean : 0.4795
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :115.0000 Max. :285.0000
Certain types of packets dominate. Some types of packet has very small amount of counts.
## SIZE X.IPv4. SCTP IPv6 ESP UDP
## 1139779 1755133537 1 1660 2828143 193513127
## GRE ICMP TCP UNKNOWN X.IPv6. ICMP6
## 244369 2458952 1556086279 1006 56018871 212226
## UDP.1 TCP.1 X.IPv6t. ICMP6.1 UDP.2 TCP.2
## 849794 54956851 1523 627 183 713
Summing up the packet counts across different SIZE confirm the domination of certain protocols. For eample, X.IPv4., UDP, TCP, X.IPv6., UDP.1, TCP.1. have way larger values than others.
The data consists of 18 variables. The first variable “SIZE” is the size of packet and is used as index. Observing the first variable “X.IPv4.” found two characteristics:
With above modification and add a linear regression, there appears to be a negative correlation between the size and amount of packet for IPv4.
Doing the same virtualization on IPv6 “X.IPv6.”a nd IPv6 Tunnel “”X.IPv6t." also show similar negative correlation. The amount of packets appear to be IPv4 leading, IPv6 second, and IPv6 Tunnel only has very small amount. This is inlne with the landscape of Internet as IPv4 is the primary protocol used, while IPv6 is the new protocol but the transition is still happening. IPv6 Tunnel is a temporary technology used during transition period only.
Further observation found that although the data supplied 18 variables, the data is not tidy. Each column represent type of packet. Melt the type into a variable will allow analysis easier between type. Further, the types of packets are under three groups.
ICMP, UDP, TCP are the most important upper layer protocols for all types of IP packets. Other protocols like SCTP, ESP, GRE only has data in IPv4 group, and are not as important, as evidence by their total amount of packets compared to the amounts of ICMP, UDP, TCP’s. Comparison focus on ICMP, UDP, TCP, as well as total within each group should provide better insight.
The data was wrangled to has new two new categorical variables. These two categorical variables will have hierarchical relationship. TYPE will be inner varaible and have value of either “TOTAL”, “TCP”, “UDP”, or “ICMP”. Another new categorical variable GROUP will be outer and have value of either “IPv4”, “IPv6”, or “IPv6t”. Finally, PACKET_COUNTS will contains the original value. SIZE is retained.
## [1] "SIZE" "TYPE" "PACKET_COUNTS" "GROUP"
After the data wrangling, amount of variables are reduced to four.
## SIZE TYPE PACKET_COUNTS GROUP
## 12993 1117 TOTAL 0 IPv6t
## 7481 66 TCP 141 IPv6
## 17131 794 ICMP 0 IPv6t
## 1201 1221 TOTAL 47411 IPv4
## 16776 439 ICMP 0 IPv6t
As the sample above, SIZE and PACKET_COUNTS retained most of the data. The 17 variables used to represent the type of packets are now identified by combination of two hierarchical, categorical variables: GROUP and TYPE.
Negative correlation between size and packet counts are consist with the single variable analysis. All types of packets has similar negative correlation although with some degree of difference.
## # A tibble: 4 x 4
## TYPE Mean Median Sum
## <fct> <dbl> <int> <int>
## 1 TOTAL 1180318. 88785 1755133537
## 2 TCP 1046460. 71155 1556086279
## 3 UDP 130137. 13430 193513127
## 4 ICMP 1654. 0 2458952
Majority are TCP packets with a fraction of UDP packet, and far less of ICMP packets.
IPv6 UDP appears to have different relationship than other types.
## # A tibble: 4 x 4
## TYPE Mean Median Sum
## <fct> <dbl> <int> <int>
## 1 TOTAL 37672. 1853 56018871
## 2 TCP 36958. 1762 54956851
## 3 UDP 571. 3 849794
## 4 ICMP 143. 0 212226
Majority are TCP packets with a fraction of UDP packet, and far less of ICMP packets.
Although the amount of packet counts are far less, IPv6t appears to have similar structure as IPv4.
## # A tibble: 4 x 4
## TYPE Mean Median Sum
## <fct> <dbl> <int> <int>
## 1 TOTAL 1.02 0 1523
## 2 TCP 0.479 0 713
## 3 UDP 0.123 0 183
## 4 ICMP 0.422 0 627
Statistics show that IPv6t has far insignificant amount of counts compared to IPv4 and IPv6.
Total amount of packets appear to have similar structure of slight negative correlation.
## # A tibble: 3 x 4
## GROUP Mean Median Sum
## <fct> <dbl> <int> <int>
## 1 IPv4 1180318. 88785 1755133537
## 2 IPv6 37672. 1853 56018871
## 3 IPv6t 1.02 0 1523
Statistics shows that IPv4 far exceed IPv6, with IPv6t negligible.
TCP has the same slight negative correlation.
## # A tibble: 3 x 4
## GROUP Mean Median Sum
## <fct> <dbl> <int> <int>
## 1 IPv4 1046460. 71155 1556086279
## 2 IPv6 36958. 1762 54956851
## 3 IPv6t 0.479 0 713
Same IPv4 >> IPv6 >> IPv6t in terms of counts.
Both IPv4 UDP and IPv6 UDP is quite unique compared to others. More in the analysis section.
## # A tibble: 3 x 4
## GROUP Mean Median Sum
## <fct> <dbl> <int> <int>
## 1 IPv4 130137. 13430 193513127
## 2 IPv6 571. 3 849794
## 3 IPv6t 0.123 0 183
## # A tibble: 3 x 4
## GROUP Mean Median Sum
## <fct> <dbl> <int> <int>
## 1 IPv4 1654. 0 2458952
## 2 IPv6 143. 0 212226
## 3 IPv6t 0.422 0 627
This section attempt to combine all variables to discover the relationship between them.
Plotting with point for one group (IPv4) and as line for another group (IPv6) show the scale of packet counts difference between.
Use point plot for while colored by Type and shaped by Group.
Reversed the color and shape to be colored by Group and shaped by Type. This has better visuability because of the scale of different between groups.
Given that the Type of TOTAL is redundant information, it’s just a sum of all types combined, remove it could make the plot less crowded. This is colored by Type and shaped by Group.
Same with TOTAL type removed and then colored by Group and shaped by Type.
## SIZE TYPE PACKET_COUNTS GROUP GROUP_TYPE
## 12952 1076 TOTAL 0 IPv6t IPv6t TOTAL
## 16420 83 ICMP 0 IPv6t IPv6t ICMP
## 388 408 TOTAL 123047 IPv4 IPv4 TOTAL
## 5852 1411 ICMP 0 IPv4 IPv4 ICMP
## 1946 479 TCP 193139 IPv4 IPv4 TCP
Concatenated the two categorial variables Group and Type into one.
The plot with concatenated group and type variables.
Three methods were attemped to plot two categorical variables on top of x, y scale two numeric variables.
Of all the graphing and analysis done, three major findings are summarized in different plots.
By tuning the colors to be unique for each type and also has color family of orange for IPv4, green for IPv6, and blue for IPv6t, the unique characteristic of each type of packet for each group are visually more pronounced in a log10 scale graph of both packet size and packet counts.
It can be visualized that packet counts of IPv4 is more than IPv6, and packet counts of IPv6 more than IPv6t. This is one a log scale so the difference is even bigger than the visualization shows.
Also, each type of packet tends to less packet counts when the size is larger. Although the reduction in packet counts are different for each type.
The two majority of the findings can be more easily vitualized by showing the combined packet counts value for each group of IPv4, IPv6, and IPv6t.
The negative correlation between packet counts and packet sizes does not some outliers that are different. Specifically the UDP packets for IPv4 and IPv6 both show a bit different behavior. For IPv4, the negative correlaction was broken a little when the packet size is close and beyound 1000. Implying a special type of application using UDP with large packet size is present in IPv4. Interesting, this type of packet might be absent from IPv6. Void of application with large size UDP packet, the amount large size UDP packets dropped even more significantly in IPv6.
UDP has major application in DNS which uses small packet size. It also has a major application in live video streaming. Video streaming tends to use large packet for efficiency of transmission. This observation implies that it could be that live video streaming application is present in IPv4 but not IPv6. Additional data not available in this study will be need to confirm this suspicion.
Exploratory Data Analysis is fun in that you never know what you will find until the end of the process. Whether the data is useful or enough to provide insights. Through the process, we found the following:
This study references the various materials publicablly available on the Internet.